The Backwards Arrow of Time of the Coherently Bayesian Statistical Mechanic 
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Many physicists think that the maximum entropy formalism is a straightforward application of 
Bayesian statistical ideas to statistical mechanics. Some even say that statistical mechanics is just 
the general Bayesian logic of inductive inference applied to large mechanical systems. This approach 
identifies thermodynamic entropy with the information-theoretic uncertainty of an (ideal) observer's 
subjective distribution over a system's microstates. In this brief note, I show that this postulate, plus 
the standard Bayesian procedure for updating probabilities, implies that the entropy of a classical 
system is monotonically non-increasing on the average — the Bayesian statistical mechanic's arrow of 
time points backwards. Avoiding this unphysical conclusion requires rejecting the ordinary equations 
of motion, or practicing an incoherent form of statistical inference, or rejecting the identification of 
uncertainty and thermodynamic entropy. 
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Recent years have seen renewed interest in connections 
between physics and statistics 0. Of particular inter- 
est, naturally, has been the connection between statisti- 
cal mechanics and statistical inference. The subjectivist 
approach to statistical mechanics, ably advocated in re- 
cent times by Jaynes and his school, holds that prob- 
abilities represent degrees of belief; specifically, the prob- 
ability of a microstate is the degree to which an ideal 
observer should believe the system is in that state, given 
the evidence available, and the entropy of the system 
is that observer's uncertainty as to the microstate. The 
theory governing the coherent use of subjective probabili- 
ties is called Bayesian statistics Q. Jaynes, in particular, 
claimed that statistical mechanics is just an application 
of the general logic of Bayesian inference. The validity 
of statistical mechanics would then be independent of 
such tricky dynamical properties as ergodicity, mixing, 
etc., which on other interpretations are vital. For sub- 
jectivists, the intensive study of the ergodic properties of 
mechanical systems is simply time wasted. 

While controversial |4|, the Bayesian vision of statis- 
tical mechanics is powerful and appealing. However, it 
has a flaw which has not been pointed out before. The 
second law says that the entropy of a closed system is 
non-decreasing; this provides the arrow of time. In I 
prove that, equating thermodynamic entropy with sub- 
jective uncertainty, ordinary Bayesian inference implies 
that entropy is non-increasing over time, at least on av- 
erage and sometimes strictly, f ill Al investigates the long- 
run behavior of the distribution under Bayesian updat- 
ing; it is ancillary to the main line of argument.) This 
is completely unphysical, so ijH] examines the proof's as- 
sumptions. There are strong arguments that any coher- 
ent use of subjective probabilities must employ Bayesian 
updating. In any case, replacing it by repeated applying 
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the maximum entropy principle still reverses the arrow of 
time. This forces a choice between inconsistent, ad hoc 
rules of statistical inference, or abandoning the equation 
of physical entropy with uncertainty. 

This rest of this introduction fixes notation, following 
an0 - makes explicit some innocent assumptions. 

Start with a classical mechanical system with a phase 
space T; write x for a point in this phase space. F is a 
distribution on T, representing an (ideal) observer's un- 
certainty about the microscopic state s. For simplicity, 
assume this distribution has a density / (i.e., is abso- 
lutely continuous with respect to Lebesgue measure on 
T). Denote expectation by angle brackets, so the mean 
of M is (M) ; subscripts will specify the distribution used 
in the expectation when necessary, i.e. (M) F is the mean 
of M under distribution F. (M\N) is the expectation of 
M conditional on N. 

The system's equations of motion lead to a discrete- 
time evolution operator T on T, which I will assume is 
non-singular. (Everything still works in continuous time, 
but needs more symbols.) By a slight abuse of notation, 
T also denotes the induced Frobenius-Perron operator 
taking distributions on T into new distributions on T. 
The specification of T also induces an evolution opera- 
tor for observables, the Koopman operator U, such that 
U(f)(x) — cj)(T(x)) for any sufficiently well-behaved (L°°) 
function <f>. From this definition, it can be seen that 
(U(j>) F = (<j)) TF . (The difference between the Frobenius- 
Perron and Koopman operators is like that between the 
Schrodinger and Heisenberg pictures, respectively.) 

There is, in addition to the microscopic degrees of free- 
dom, a set of macroscopic degrees of freedom, collectively 
M. These observables depend only on the present micro- 
scopic state, though possibly noisily, through some ob- 
servation density p(M = m\X = x). 

Write Fq for the initial distribution, and F t for the 
distribution at time t. The distribution Fq may be de- 
rived via a maximum-entropy procedure, starting from 
an initial observation M(0) = mg Q. However, it re- 
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ally doesn't matter where Fq comes from, or what form 
it takes. Finally, write H[F] for the Shannon entropy of 
the distribution F, i.e., 

H[F] = - J f(x) log f(x)dx (1) 

The information content of a random variable X is de- 
fined to be the entropy of its distribution function, which 
for convenience will also be written H[X]; it should al- 
ways be clear which is meant. Conditional information 
content, i?[X|Y = y], is the entropy of the conditional 
distribution. 



I. DERIVATION OF THE BACKWARDS 
ARROW 

So far, I have either been fixing notation, or making 
assumptions which are common to all approaches to sta- 
tistical mechanics, and so presumably innocuous. I now 
make three explicit and substantive assumptions. 

I The evolution operator T is invertible. 

II The probability distribution over mi- 
crostates gets updated by the usual appli- 
cation of Bayes's rule, p(X — x\Y = y) = 
p(Y = y\X = x)p(X = x)/p(Y = y). 

Ill The thermodynamic entropy at time t, St, is 
equal to H[F t \. 

These assumptions reverse the arrow of time, i.e., they 
make entropy non-increasing. 

Begin with the initial distribution over microstates, Fq. 
After one time step, this is transformed to a new distri- 
bution, TFq. From a Bayesian perspective, this does not 
represent a change in our knowledge of the system, merely 
keeping our predictions up to date. (Rather than updat- 
ing the distribution, we could use the Koopman operator 
to update observables.) It is a well-known consequence 
of assumption [I] that H[TF ] = H[Fq], i.e., that con- 
servative dynamics are entropy-preserving [5|, Theorem 
9.3.1]. However, we now make a new measurement of the 
macroscopic observable M, getting the value mi. Then, 
via assumption [HJ Bayes's rule gives us a new distribu- 
tion: 

Mx) = P(mi\x)Tf (x) . 
J T p(mi\x)Tf (x)dx 

Now, fi(x) is simply the density of X 1: conditional on 
Mi = mi. So H[Fi] = H[Xi\M\ = rru]. An elemen- 
tary inequality of information theory |(j tells us that 
"conditioning reduces entropy"; specifically, 7JLY|M] = 
(H[X\M = m]} < H[X], with equality if and only if X 
and M are statistically independent. Using assumption 
IIIII to identify the thermodynamic entropy St and the 
Shannon information H[F t ], 

(5i> - (H[F!]) < H[Fq] = So (3) 



Thus, unless the macroscopic observable is in fact merely 
noise, the entropy decreases on average. While there may 
be values of m which are so uninformative they increase 
an observer's uncertainty about the microscopic state, on 
average every observation helps narrow that uncertainty. 

A stronger result follows from the common idealiza- 
tion that observables are deterministic functions of mi- 
croscopic state, M(x) = m. In this case, p{m\x) is ei- 
ther or 1, depending on whether M(x) — mi or not. 
Thus f x {x) = T/ (x)l M -i(mo( x )/ T ^o(M-Hmi)), i.e., 
the truncation of TFq to the part of T compatible with 
the macroscopic observation. Unless M _1 (mi) includes 
the entire support of TFo, F\ is a more concentrated 
measure than TFo or Fq, and so the entropy has strictly 
decreased, and not just on average. 1 

Under repeated measurements, the entropy is non- 
increasing, either on average or strictly, depending on 
whether the measurements are noisy or not. (Entropy is 
constant between observations.) In the case of discrete- 
valued deterministic measurements, if the measurement 
partition is "generating" 0,0, then the volume of T com- 
patible with a sequence of measurements shrinks towards 
zero, and so the uncertainty, as measured by the Shan- 
non information, tends to — oo. This is not necessarily 
the case if the measurement partition is not generating. 

Note that I required nothing of the dynamics other 
than assumption [J invertibility. In particular, I did not 
need chaos, ergodicity, mixing, etc., either at the micro- 
scopic or macroscopic level. Thermodynamic equilibrium 
or its absence is also irrelevant. 



A. Long-Run Behavior of H[F t ] 

Describing the long-run behavior of H[F t ] requires ex- 
plicit use of measure-theoretic probability |9j, and what's 
called "Doob's martingale". As a measurable space, Y 
comes with a cr-algebra of measurable sets Q. Let G 
be any set in Q. Then \q{X) is a random variable, in- 
dicating whether or not X 6 G, and (lc) Fg = F(G), 
the probability of the set G under distribution Fo. Let 
Ait = cr(Mi, . . . M t ), the smallest cr-algebra with respect 
to which all the observables up to M t are measurable, and 
examine {lo\Ait) Fo , the conditional expectation of the 
indicator variable for G. Clearly, (la\A4t) F = (1g) f = 
F t (G). F t (G) is a martingale and converges almost surely 
and in mean square to a random variable Foo (G) , which is 
the conditional expectation of 1g with respect to .Moo, 
the smallest cr-algebra containing all the Ait 9, §6.6]. 
Thus, the conditional measures F t converge weakly on 
a limit F^ [! §7.1]. Thus, the entropy of H[F t ] also 



The most important case where supp TFq C Af _1 (mi) is when 
M is a constant of the motion, e.g., total energy for a Hamilto- 
nian system. Entropy is then constant after the first measure- 
ment, even if the system begins arbitrarily far from equilibrium. 
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converges on a limiting value. If Moo = G, as in the gen- 
erating partition case, then, for any set G, F oc (G) = 
or = 1, and H[Foo] — —oo. If Moo C G, then the con- 
ditional distributions converge weakly on a distribution 
with a finite entropy. 2 

A somewhat more refined result is possible if we as- 
sume that the asymptotic equipartition property of infor- 
mation theory holds (i.e., that the Shannon-Macmillan- 
Breiman theorem applies). This leads to estimates of 
the asymptotic growth rates for likelihoods, and so for 
posterior probabilities in Bayes's rule. 

Suppose that the following limits exist for every 

T: 



h{x) 

ee lim-H[M n \M n _ u M n _ 2 ...M 1 ,X = x] (4) 
n 



lim ■ 



d{x,y) 



d n m p(rrii\x) logp(m™|x) 



p(rrii\x) 



lim- / d n m p(m"|y)log , 

nj P{rrii\y) 



(5) 



(6) 



where p(m r l\x) abbreviates p{M\ — mi,M2 = 
TO2,...M„ = m n \Xo = x). The quantity h(x) is 
the macroscopic entropy rate at x (not to be con- 
fused with the microscopic rate of entropy produc- 
tion ^3)- d(x,y) is the macroscopic relative entropy 
rate, or Kullback-Leibler divergence rate, between x 
and y. Note that d(x,y) > 0, and that d(x,y) = 
if and only if p(M n = ra n \M^ x = m™ _1 , X = y) and 
v(M n = to„|M" _1 ,Xo = x) converge for almost all m™ 
Further assume that the asymptotic equipartition 
property 01 holds, so that, if X = y, then for F - 
almost-all x 

lim-ilogp(m"|x) = h(y) + d{x,y) (7) 
n 

almost surely. An immediate corollary of Eq. [7| is that 

logp(m™|x) = -nh(y)-nd(x,y)+g(x,y,m , l) (8) 

where g is a random quantity which is o(n) almost surely. 

Write Bayes's rule with n observations in logarithmic 
form, and substitute in Eq. [S] (assuming fo(x) > 0): 

log^fr = lo gP«N) - kg<p(m?|a:)> Fo (9) 
Jo( x ) 

= -nh(y)-nd(x,y)+g(x,y,mi) 



log ( e 



-nh(y)-nd(x,y)+g{x,y,m r l) 



F 



(10) 



-nh(y) - nd(x, y) + g(x, y, m") 



loge 



-nh(y)/ 



nd(x,y) e g(x,y,m") 



-nd(x, y) + g{x, y, m") - j{y, m") 



log 



-nd(x,y)\ 
1 1 



(11) 



(12) 



7 is another o(n) random quantity; for later use, set 
rj(x, y, m^ 1 ) = g(x, y, m\) - j(y, m 7 ?). 



log 



-nd(x,y)\^ 



F 



io g y 


dxf (x)\ 


e -d{x,y) 


log ^ 


e -d(x,y) 


n \ n 






FoJ 


nlog 


e -d(x,y) 


n 






F 



The last 



line 

l/n 



-nd„(y) 
defines d n (y), 



(13) 
(14) 

(15) 
(16) 



and 



is the L n norm of the function 



with respect to the measure F The latter is non- 
decreasing in n, and its limit is the essential supremum 
of the function. That is, ||e- d ^)||°? is the smallest 

II lli*o 

u such that u > e^^'^ for all x, except on a set 
of Fo-probability zero. It follows that doo(y) is the 
essential infimum of d(x,y), and so doo(y) < d(x,y) 
everywhere except on a set of -fo-probability zero. Since 
d{x,y) > 0, we can be sure that doo{y) is at least zero. 
Since d(y,y) = 0, if /o is positive in every sufficiently 
small neighborhood of y, we will have <ioo(y) = 0. Since 
the procedures used to construct prior distributions for 
statistical mechanics generally give non-vanishing weight 
to all physically accessible regions of the p hase sp ace, 
this last assumption is reasonable, and so set doo(y) = 0. 
Substituting back in to Eq. 1121 

fn(x) 



log 



fo(x) 



-nd(x, y) + r)(x, y, to") + nd„(y) (17) 



= n{d n {y)-d{x,y)) + ri(x,y,m r l) (18) 
Taking the limit as n — > oo, 



lim - log 4tt = d °° (v) ~ V) 



= -d{x,y) 



(19) 
(20) 



Asymptotically, therefore, f n (x) shrinks exponentially 
fast towards zero, unless d(x,y) — 0. Setting D{y) = 
{x \d(x, y) = 0}, we see that F 00 (D(y)) = 1, and so 
H [Foo] is at most the logarithm of the volume of D(y). 



II. WAYS TO AVOID THIS RESULT 



2 By Eq. El the sequence H[Ft] forms a supermartingale with 
respect to the filtration induced by the macro- variables Mt, but 
the conditions needed to directly apply martingale convergence 
theorems, such as (| [-Ft] | ) < oo, do not necessarily hold. 



Since, in reality, thermodynamic entropy is monoton- 
ically non-decreasing, at least one of the assumptions 
leading to Eq. [3] must be wrong. 
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A. Assumption [JJ Invertible Dynamics 

Denying assumption [I] "has all the advantages of theft 
over honest toil" [13 ■ The problem of the foundations of 
statistical mechanics is precisely that of deriving macro- 
scopic irreversibility from microscopically-reversible dy- 
namics, and the point of the Bayesian approach was to 
do so with making detailed assumptions about those dy- 
namics. That said, crime may not pay; the conditions 
needed to get H[TF] > H[F] impose highly non-trivial 
restrictions on the dynamics |l3j . Not only are conser- 
vative Hamiltonian dynamics ruled out, but so is any 
system of ordinary differential equations. What is re- 
ally needed is that H[TF\M] > H[F], and it is hard to 
see why the microscopic dynamics should always produce 
more than enough entropy to off-set the information pro- 
vided by whichever macroscopic observables we happen 
to choose. But without such cancellation, watching a pot 
closely enough will keep it from boiling. 



B. Assumption ITU Bayesian Updating 

Explicitly or not, most advocates of Bayesian statisti- 
cal mechanics reject assumption [H] 0, 0] . For instance, 
Jaynes put forward the following derivation of the sec- 
ond law 0: Start with an initial observation of an ob- 
servable, Mo = mo. Confine ourselves to distributions 
p which have (M) = mo; call the set of such distribu- 
tions Co • Let the member of Co with the highest entropy 
be Jo; we select this as our initial distribution. Now let 
it evolve forward in time, giving TJ ; by assumption |TJ 
H[Jq] = H[TJq\. The time evolution leads to a certain 
value for the observable, mi = (M) TJg . Now consider 
the class of distributions C\ with (M) — mi; let the 
maximum entropy member of this class be J\. Since 
TJ 6 Ci, it follows that H[J{\ > H[TJ ] = H[J }. 
Jaynes then identifies Si with H[Ji], i.e., he updates the 
distribution by re-applying the maximum entropy prin- 
ciple, using only the current observation, rather than by 
applying Bayes's rule. 

There are good reasons to doubt the wisdom of using 
probability as a measure of degree of belief 0. But if 
you are going to do that, then the Bayesian way is the 
right way to do so, and you need to use conditioning. 
Failure to do so is incoherent, as the well-known "Dutch 
Book" arguments show. 1 1 6l | gives a clear introduction; 
see for details.) In particular, in the formally very 
similar problem of nonlinear filtering ^tJj application of 
Bayes's rule is demonstrably optimal, and forgetting all 
earlier observations is not, regardless of whether one in- 
terprets probability subjectively. 3 



3 The doubts raised by lid about inter-temporal updating are not 
relevant. There's no time lapse between TFo and Fi, just the 
addition of the information that Mi = mi . 



Even if Bayesian statistical mechanics are free to not 
use conditioning, they still get a backwards arrows of 
time. A consistent use of the principle of maximum en- 
tropy, given the two observations Mo = mo and Mi = 
mi, would go as follows. First, restrict ourselves to distri- 
butions p which satisfy both the constraints (M) = mo 
and (M) T = mi. It is awkward to have one constraint 
on p and another on Tp; using the Koopman operator, 
we can turn the latter into a constraint on p as well, 
{UM) p — mi. Let us write C01 for the class of distribu- 
tions satisfying these two constraints. Those satisfying 
the first constraint are the class we called Co above, and 
those satisfying the second constraint form a subclass of 
the set we called Ci above. Hence Cn C Co n Ci. Then 
the maximum entropy principle tells us to pick the dis- 
tribution J01 given by 

J01 = arg max H[p] (21) 

pecv 

Since Cn C Co and Cn C Ci, it is clear that -ff[Joi] < 
min H[Jq], H[Ji]. Thus, updating our distribution by 
maximizing entropy, rather than conditioning, still re- 
verses the arrow of time: by assumption IIIII Si — 
H[TJ 01 ], and by assumption |T| H[TJ 01 ] = H[J 01 ] < 
H[Jo] = Sq. 

To avoid getting the direction of the arrow of time 
backwards, the Bayesian or Jaynesian statistical me- 
chanic must ignore the known prior history, a procedure 
quite without statistical justification. By use of the op- 
erator U, constraints on a single observable over multiple 
times can be converted into constraints on multiple ob- 
servables at a single time, which we are normally told 
must all be incorporated into the distribution F. There 
does not seem to be any reason why it should be legiti- 
mate to take M as a constraint in the maximum entropy 
procedure, but not UM. Worse yet, under some cir- 
cumstances subjectivists £§•£•, Jaynes, in his discussion 
of spin-echo experiments [2|,|j]) have been explicit about 
needing to incorporate historical information in order to 
avoid unphysical predictions. 

Whether we update via conditioning, or by applying 
the maximum entropy principle, we get unphysical re- 
sults for the entropy, and can avoid them only by incon- 
sistency about whether historical data counts, or, equiv- 
alently, whether all observables must inform the postu- 
lated distribution. One might argue that, for most sys- 
tems of interest, the distribution obtained from applying 
the maximum entropy principle only to ordinary observ- 
ables at the current time leads to nearly the same predic- 
tions as the coherent procedures, but the former is much 
easier to calculate than the latter, particularly if the dy- 
namics are very irregular. In such a case, the complexity 
of computing the n th iterate of the evolution operator, 
T n , may grow rapidly with n, so that a computationally- 
limited agent, acting under time pressure, might prefer 
an approximation which neglects historical data to an 
exact but intractable update. The validity of such an ap- 
proximation would depend on the ergodic properties of 
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the dynamics (e.g., mixing), and it is precisely to avoid 
such dependence that the Bayesian approach was intro- 
duced. Worse, it would lead to a novel kind of Maxwell's 
demon, a purely passive observer who can make the en- 
tropy decrease during a time interval which depends on 
the observer's processing speed and the time-complexity 
of computing T n . In any event, none of this would ex- 
plain why the thermodynamic entropy should match the 
entropy of this approximate distribution. 

C. Assumption ITTT1 Thermodynamic Entropy Is 
Subjective Uncertainty 

Assumption lllll is that thermodynamic entropy S is the 
information-theoretic uncertainty H[F]. Denying this 
seems to me a completely satisfactory option. Macro- 
scopically, entropy is defined by its relations to the ob- 
servables of heat and temperature. Microscopically, as- 
suming the usual representation of phase space, entropy 
is the logarithm of the volume in phase space compatible 
with the current macroscopic state 19, 20, 21]; more gen- 
erally it is the logarithm of the measure of that region 

Rejection of assumption 11111 is perfectly compatible 
with accepting a Bayesian, subjectivist interpretation of 
probability. 



III. CONCLUSION 



A backwards arrow of time follows directly from the 
combination of assumptions I, II and III, at least one of 
which must therefore be rejected. Rejecting assumption 
[I] invertible microphysical dynamics, entails considerable 
modification of basic physics. Rejecting assumption [HI 
updating subjective probabilities via Bayes's rule, is ac- 
tually insufficient; one must also reject the principle of 
maximum entropy, or at the very least apply it in an in- 
coherent way, sometimes taking into account all observa- 
tional constraints, sometimes not. Rejecting assumption 
IIIII the identification of thermodynamic entropy with the 
Shannon information H[F], seems to lead to the least 
trouble. I do not pretend that only one choice among 
these alternatives is defensible, but some choice is neces- 
sary. 
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